About The Project
Executive Summary

This study investigates the factors influencing the price and cut quality of diamonds. By analyzing a comprehensive dataset, we aim to determine how attributes such as carat, color, clarity, depth, and table affect diamond prices. Additionally, we examine which features significantly impact the quality of a diamond’s cut. These insights can guide consumers and industry professionals in making informed decisions about diamond valuation and quality assessment.

The Problem Description

This project examines the Price and Cut of diamonds. We will perform both regression and classification analysis. We will divide the data into training (80%) and testing (20%) datasets. The goal for the regression models is to predict the price of a diamond using the predictor variables in the dataset. For this analysis, we will look for relationship between Price and other variables. We will use various methods including Linear Regression, Bagged Tree and Random Forest. Then, we will perform classification analysis predicting if the diamond is of high or acceptable cut using the predictor variables. We will use Logistic Regression, Random Forest, and Gradient Boosted Model. Finally, we end with summarizing our conclusions. We will examine the variables in the dataset to determine what helps to predict the price and cut of a diamond.

The Data

This dataset has 53,940 rows and 10 variables. However, with the size of this dataset, I have decided to sample 25% of total observations for better performance and speed.

Data Sources

Kaggle: Diamonds (link: https://www.kaggle.com/datasets/shivam2503/diamonds?resource=download)

Variables
TO PREDICT WITH
  • carat: Carat weight of the diamond
  • color: Color quality of the diamond
  • clarity: The diamond’s level of obvious inclusions
  • length: Length of the diamond
  • width: Width of the diamond
  • depth: Height of the diamond
  • table_perc: Width of diamond table, expressed as the proportion of its average diameter
  • depth_perc: Height of the diamond, expressed as the proportion from the cutlet divided by its average girdle diameter
WE WANT TO PREDICT
  • price: Price of the diamond
  • cut: Cut quality of the diamond
Data Overview

From this data we can see that our variables have a variety of different values based on their types. carat has a mean of 0.79 but max of 4.1. We see several variables having a wide range of values, most noticeably price with a range from 336 to 18823, or width with a range of 0 to 9.94 There may be high correlation between the variables as well, for example depth and depth_perc, so we probably will remove one in the analysis. For our target variable cut, the value is High if the cut if Premium or Ideal, and Acceptable if the value is Fair, Good, or Very Good.

View the Data Summaries

Now we can see the range of values for each variable.

     carat                cut          color        clarity         price      
 Min.   :0.2000   High      :8873   Tier 1:4134   Tier 1:2633   Min.   :  336  
 1st Qu.:0.4000   Acceptable:4612   Tier 2:5275   Tier 2:5070   1st Qu.:  954  
 Median :0.7000                     Tier 3:4076   Tier 3:5782   Median : 2381  
 Mean   :0.7952                                                 Mean   : 3904  
 3rd Qu.:1.0400                                                 3rd Qu.: 5274  
 Max.   :4.1300                                                 Max.   :18823  
     length           width           depth         table_perc    
 Min.   : 0.000   Min.   :0.000   Min.   :0.000   Min.   :0.4300  
 1st Qu.: 4.710   1st Qu.:4.720   1st Qu.:2.910   1st Qu.:0.5600  
 Median : 5.690   Median :5.700   Median :3.520   Median :0.5700  
 Mean   : 5.725   Mean   :5.727   Mean   :3.536   Mean   :0.5743  
 3rd Qu.: 6.530   3rd Qu.:6.530   3rd Qu.:4.030   3rd Qu.:0.5900  
 Max.   :10.020   Max.   :9.940   Max.   :6.430   Max.   :0.9500  
   depth_perc    
 Min.   :0.4300  
 1st Qu.:0.6110  
 Median :0.6190  
 Mean   :0.6176  
 3rd Qu.:0.6250  
 Max.   :0.7220  
Average Price by Cut
cut n mean(price)
High 8873 3865.20
Acceptable 4612 3977.29
Average Price by Color
color n mean(price)
Tier 1 4134 3091.86
Tier 2 5275 3871.84
Tier 3 4076 4767.78
Average Price by Clarity
clarity n mean(price)
Tier 1 2633 2894.31
Tier 2 5070 3872.72
Tier 3 5782 4390.13
Response Variables relationships with predictors
  • We can see that about 69% of the data are categorized as ‘High’ in cut quality. Looking at the potential relationship, we can see the strongest relationships are with carat and length.

  • We see the largest concentration of diamonds’ price around $0-$5,000. The data is also skewed to the right. Looking at potential relationships, we can see strong relationships between price and carat, length, width, and depth, suggesting these variables have impacts on the price of a diamond.

  • The higher than average relationship between certain variables (example: width and leghth) may be sign of multicollinearity and we will probably address this later in the analysis.

Distribution of diamond cut
Histogram of Diamond Price
Training and Testing Datasets

Here is a look at the training and testing dataset

Dataset Number_of_Obs
Training 10787
Testing 2698
Regression: Predicting Diamond Price

Here is a look at a regression model predicting price

term estimate std.error statistic p.value
(Intercept) 10637.883 1125.416 9.452 0.000
carat 10138.990 125.123 81.032 0.000
cutAcceptable -180.678 29.306 -6.165 0.000
colorTier 2 -250.018 30.609 -8.168 0.000
colorTier 3 -1139.673 33.367 -34.156 0.000
clarityTier 2 -624.855 35.696 -17.505 0.000
clarityTier 3 -1749.819 36.987 -47.308 0.000
length -1396.654 182.366 -7.659 0.000
width 630.738 158.448 3.981 0.000
depth 120.892 198.791 0.608 0.543
table_perc -5197.107 658.126 -7.897 0.000
depth_perc -10311.035 1510.877 -6.825 0.000
Model RMSE RSquare MAE
Linear Regression 1251.52 0.902 818.497
Logistic Regression: Predicting Cut

Here is a look at a logistic regression model predicting diamond cut

term estimate std.error statistic p.value
(Intercept) -42.594 5.919 -7.196 0.000
carat 0.652 0.329 1.986 0.047
colorTier 2 -0.116 0.057 -2.015 0.044
colorTier 3 -0.266 0.066 -4.019 0.000
clarityTier 2 -0.013 0.069 -0.195 0.846
clarityTier 3 0.110 0.077 1.424 0.154
price 0.000 0.000 -7.207 0.000
length -16.123 0.719 -22.411 0.000
width 11.907 0.688 17.303 0.000
depth 7.238 1.587 4.561 0.000
table_perc 47.349 1.322 35.820 0.000
depth_perc 21.403 9.427 2.270 0.023
Model Accuracy Sensitivity Specificity Precision AUC
Logistic Regression 0.764 0.9 0.502 0.776 0.795
  • We can see that up until the price of $17,500, the residuals look like a “blob” and doesn’t necessary form any pattern. However, beyond the price of $17,500, we see a curve that try to accommodate the outliers. I wouldn’t recommend using this model for price value over $17,500.
Best_Cutoff Sensitivity Specificity AUC
0.632 0.787 0.672 0.795
Pruning the model
  • We can see that depth is not statistically significant at p-value > 0.05, so we will prune it from the model.
term estimate std.error statistic p.value
(Intercept) 10232.058 906.195 11.291 0
carat 10141.535 125.049 81.100 0
cutAcceptable -179.319 29.220 -6.137 0
colorTier 2 -250.021 30.609 -8.168 0
colorTier 3 -1139.869 33.365 -34.164 0
clarityTier 2 -625.028 35.694 -17.511 0
clarityTier 3 -1750.403 36.974 -47.342 0
length -1336.464 153.167 -8.726 0
width 643.988 156.938 4.103 0
table_perc -5218.762 657.142 -7.942 0
depth_perc -9625.903 1006.697 -9.562 0
Model RMSE RSquare MAE
Linear Regression 1251.520 0.902 818.497
Linear Reg., Prune Depth 1251.294 0.902 818.318
Transform Our Dependent Variable
  • Now that all of our predictor variables are statistically significant, we should try transforming the target variables. We will use log(price) because the price column is skewed to the right.
term estimate std.error statistic p.value
(Intercept) -2.481 0.136 -18.258 0
carat -0.860 0.019 -45.874 0
cutAcceptable -0.058 0.004 -13.195 0
colorTier 2 -0.084 0.005 -18.301 0
colorTier 3 -0.283 0.005 -56.652 0
clarityTier 2 -0.205 0.005 -38.314 0
clarityTier 3 -0.441 0.006 -79.483 0
length 0.215 0.023 9.366 0
width 1.094 0.024 46.486 0
table_perc 0.905 0.099 9.184 0
depth_perc 5.399 0.151 35.768 0
Model RMSE RSquare MAE
Linear Regression 1251.520 0.902 818.497
Linear Reg., Prune Depth 1251.294 0.902 818.318
Linear Reg., Log Price 5547.157 0.781 3854.140
Pruning the model - Clarity
  • We can see that clarityTier2 is not statistically significant at p-value = 0.172, so we will prune it from the model.
term estimate std.error statistic p.value
(Intercept) -42.558 5.915 -7.195 0.000
carat 0.650 0.328 1.978 0.048
colorTier 2 -0.115 0.057 -2.006 0.045
colorTier 3 -0.265 0.066 -4.018 0.000
clarityTier 3 0.120 0.055 2.180 0.029
price 0.000 0.000 -7.271 0.000
length -16.126 0.719 -22.422 0.000
width 11.905 0.688 17.305 0.000
depth 7.243 1.586 4.566 0.000
table_perc 47.337 1.320 35.855 0.000
depth_perc 21.354 9.421 2.267 0.023
Model Accuracy Sensitivity Specificity Precision AUC
Logistic Regression 0.764 0.900 0.502 0.776 0.795
Logistic Reg. - Prune Clarity 0.762 0.899 0.499 0.775 0.795
Pruning the model - Carat
  • Now, let’s prune carat because it’s not statistically significant.
term estimate std.error statistic p.value
(Intercept) -41.757 5.739 -7.276 0.000
colorTier 2 -0.111 0.057 -1.947 0.052
colorTier 3 -0.229 0.063 -3.612 0.000
clarityTier 3 0.146 0.053 2.736 0.006
price 0.000 0.000 -8.025 0.000
length -16.086 0.711 -22.636 0.000
width 11.679 0.668 17.481 0.000
depth 7.836 1.519 5.157 0.000
table_perc 47.578 1.315 36.175 0.000
depth_perc 18.810 9.079 2.072 0.038
Model Accuracy Sensitivity Specificity Precision AUC
Logistic Regression 0.764 0.900 0.502 0.776 0.795
Logistic Reg. - Prune Clarity 0.762 0.899 0.499 0.775 0.795
Logistic Reg. - Prune Carat 0.763 0.901 0.497 0.775 0.794
Pruning the model - Width
  • Now, let’s prune width because it’s not statistically significant.
term estimate std.error statistic p.value
(Intercept) 25.393 4.208 6.034 0.000
colorTier 2 -0.117 0.056 -2.090 0.037
colorTier 3 -0.223 0.062 -3.594 0.000
clarityTier 3 0.137 0.052 2.611 0.009
price 0.000 0.000 -6.653 0.000
length -15.069 0.734 -20.517 0.000
depth 24.919 1.195 20.851 0.000
table_perc 43.532 1.258 34.612 0.000
depth_perc -85.174 6.822 -12.485 0.000
Model Accuracy Sensitivity Specificity Precision AUC
Logistic Regression 0.764 0.900 0.502 0.776 0.795
Logistic Reg. - Prune Clarity 0.762 0.899 0.499 0.775 0.795
Logistic Reg. - Prune Carat 0.763 0.901 0.497 0.775 0.794
Logistic Reg. - Prune Width 0.750 0.903 0.456 0.761 0.767
Model Description
  • For this part of the analysis, I perform a Tuned Bootstrap Aggregrating (Bagged) model to predict the price.
  • For the CV folds, I want 5-fold cross validation, so that the training data is split into 5 equally sized folds. I set the number of trees to 500 in the random forest, and the model will test the minimum sample split into 5, 15, or 25. Similarly, the model will also test the maximum depth of the trees, with the values of 5, 6, 7, 8.
  • I then let R choose the parameter for maximum performance, using the default metrics RSME.
Setting up the model
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
price ~ .

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = .preds()
  trees = 500
  min_n = 5

Engine-Specific Arguments:
  importance = impurity
  max.depth = 8

Computational engine: ranger 
Variable Importance Plot
Model RMSE RSquare MAE
Linear Regression 1251.520 0.902 818.497
Linear Reg., Prune Depth 1251.294 0.902 818.318
Linear Reg., Log Price 5547.157 0.781 3854.140
Tuned Bagged Model 853.026 0.954 455.336
Model Description
  • For this part of the analysis, I perform a Tuned Random Forest model to predict the price.
  • For the CV folds, I want 5-fold cross validation, so that the training data is split into 5 equally sized folds. I set the number of trees to 500 in the random forest, and the model will test the minimum sample split into 5, 15, or 25. Similarly, the model will also test the maximum depth of the trees, with the values of 5, 6, 7, 8. The number of variables in the model ranges from 4 to 7.
  • I let R choose the parameter for maximum performance, using the default metrics RSME.
Setting up the model
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
price ~ .

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 7
  trees = 500
  min_n = 5

Engine-Specific Arguments:
  importance = impurity
  max.depth = 8

Computational engine: ranger 
Variable Importance Plot
Model RMSE RSquare MAE
Linear Regression 1251.520 0.902 818.497
Linear Reg., Prune Depth 1251.294 0.902 818.318
Linear Reg., Log Price 5547.157 0.781 3854.140
Tuned Bagged Model 853.026 0.954 455.336
Tuned Random Forest Model 851.051 0.955 453.298
Model Description
  • For this part of the analysis, I perform a Random Forest model to predict the cut of a diamond
  • Due to performance issue, I couldn’t afford to tune the model. I will assign the number of variables randomly selected for the model (mtry) as 5, number of trees (trees) as 500, minimum number of observations (min_n) as 15, and maximum tree depth (max.depth) as 7.
Setting up the model
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~5,      x), num.trees = ~500, min.node.size = min_rows(~15, x), importance = ~"impurity",      max.depth = ~7, num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  500 
Sample size:                      10787 
Number of independent variables:  9 
Mtry:                             5 
Target node size:                 15 
Variable importance mode:         impurity 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.1216869 
Variable Importance Plot
Model Accuracy Sensitivity Specificity Precision AUC
Logistic Regression 0.764 0.900 0.502 0.776 0.795
Logistic Reg. - Prune Clarity 0.762 0.899 0.499 0.775 0.795
Logistic Reg. - Prune Carat 0.763 0.901 0.497 0.775 0.794
Logistic Reg. - Prune Width 0.750 0.903 0.456 0.761 0.767
Classification Random Forest 0.845 0.970 0.605 0.825 0.878
Best_Cutoff Sensitivity Specificity AUC
0.668 0.895 0.724 0.878
Model Accuracy Sensitivity Specificity Precision AUC
Logistic Regression 0.764 0.900 0.502 0.776 0.795
Logistic Reg. - Prune Clarity 0.762 0.899 0.499 0.775 0.795
Logistic Reg. - Prune Carat 0.763 0.901 0.497 0.775 0.794
Logistic Reg. - Prune Width 0.750 0.903 0.456 0.761 0.767
Classification Random Forest 0.845 0.970 0.605 0.825 0.878
Classification Random Forest with Cutoff 0.837 0.895 0.724 0.862 0.878
Model Description
  • For this part of the analysis, we perform a Tuned Gradient Boosted (XGBoost) model to predict the cut of a diamond.
  • We use a grid of hyperparameters using a Latin hypercube sampling strategy, where class_xg_grid will contain a dataframe or tibble with 10 rows, where each row represents a unique combination of the specified hyperparameters.
Setting up the model
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Formula
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
cut ~ .

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)

Main Arguments:
  mtry = 5
  trees = 500
  min_n = 15
  tree_depth = 11
  learn_rate = 0.00339794601607989
  loss_reduction = 2.21670407129474e-10

Computational engine: xgboost 
Variable Importance Plot
Model Accuracy Sensitivity Specificity Precision AUC
Logistic Regression 0.764 0.900 0.502 0.776 0.795
Logistic Reg. - Prune Clarity 0.762 0.899 0.499 0.775 0.795
Logistic Reg. - Prune Carat 0.763 0.901 0.497 0.775 0.794
Logistic Reg. - Prune Width 0.750 0.903 0.456 0.761 0.767
Classification Random Forest 0.845 0.970 0.605 0.825 0.878
Classification Random Forest with Cutoff 0.837 0.895 0.724 0.862 0.878
Classification Gradient Boosted 0.854 0.972 0.626 0.833 0.895
Best_Cutoff Sensitivity Specificity AUC
0.646 0.86 0.776 0.895
Model Accuracy Sensitivity Specificity Precision AUC
Logistic Regression 0.764 0.900 0.502 0.776 0.795
Logistic Reg. - Prune Clarity 0.762 0.899 0.499 0.775 0.795
Logistic Reg. - Prune Carat 0.763 0.901 0.497 0.775 0.794
Logistic Reg. - Prune Width 0.750 0.903 0.456 0.761 0.767
Classification Random Forest 0.845 0.970 0.605 0.825 0.878
Classification Random Forest with Cutoff 0.837 0.895 0.724 0.862 0.878
Classification Gradient Boosted 0.854 0.972 0.626 0.833 0.895
Classification Gradient Boosted with Cutoff 0.831 0.860 0.776 0.881 0.895
Numerical Variable

To predict price, here are the most important variables:

  • Width
  • Carat
  • Clarity
  • Length
  • Color

For the price, how big, bright, and clear the diamond have a big impact!

Categorical Variable

To predict cut, here are the most important variables:

  • Length
  • Width
  • Depth
  • Depth_Perc
  • Table_Perc

We can see that the different measurements of the diamond affect the cut of the diamond.

Regression Result Table
  • We can see that compared to Linear Regression models, the Bagged Model and Random Forest Model both perform better. While they have the same R-square, our Tuned Random Forest model has a slightly lower MAE and RMSE. Therefore, I would recommend the Tuned Random Forest model for this analysis.
Model RMSE RSquare MAE
Linear Regression 1251.52 0.90 818.50
Linear Reg., Prune Depth 1251.29 0.90 818.32
Linear Reg., Log Price 5547.16 0.78 3854.14
Tuned Bagged Model 853.03 0.95 455.34
Tuned Random Forest Model 851.05 0.95 453.30
Actual vs Predicted Plot
Classification Result Table
  • For cut prediction, I would recommend using Gradient Boosted model due to the high number of sensitivity, precision, and AUC. The model was able to catch 86% - 97% of the true positive values, and out of the positive values it predicted, 83% - 88% were true (depending on whether you use a best cutoff or not).
Model Accuracy Sensitivity Specificity Precision AUC
Logistic Regression 0.76 0.90 0.50 0.78 0.80
Logistic Reg. - Prune Clarity 0.76 0.90 0.50 0.78 0.80
Logistic Reg. - Prune Carat 0.76 0.90 0.50 0.78 0.79
Logistic Reg. - Prune Width 0.75 0.90 0.46 0.76 0.77
Classification Random Forest 0.84 0.97 0.60 0.83 0.88
Classification Random Forest with Cutoff 0.84 0.90 0.72 0.86 0.88
Classification Gradient Boosted 0.85 0.97 0.63 0.83 0.90
Classification Gradient Boosted with Cutoff 0.83 0.86 0.78 0.88 0.90
ROC Curve Plot
What did you work hardest on or are you most proud of in this project?
  • My biggest challenge in this project is that re-categorizing my categorical variables. I have a lot of categories in columns clarity, color, and cut and it was difficult to determine how to re-categorize them so that I only have about 2 to 3 categories in each column.
  • Trying to determine the tuning parameters of my Random Forest and XGBoost models also posed a problem, because there are so many parameters to try out.
  • What I’m most proud of in this project is that even though I ran into speed issue when I tried to run my tuned models, I still came up with some results that was significant in predicting my target variables.
If I had another week to work on this project

If I had another week to work on this project, I would love to try out more models we have done in class, such as Lasso and SVC. I would also try to normalize and balance the data set like what we did in homework case #2 and #3, to see if transforming the data set has any impacts on the performance of the models.